Using the data collected from the Bureau of Transportation Statistics I have determined what airports are the best to fly out of and when to fly in general to have the best chance of avoiding delays for several reasons altogether. Chicago O’Hare International Airport (ORD) was determined to be the worst airport based on the average length of delay (1.13 Hours). ORD also had the second highest proportion of delays due to weather out of all the airports I looked at. The highest proportion of delays due to weather was San Francisco International Airport (SFO). I also determined that September has the lowest proportion of total flights delayed. The holiday months (January, July, November, and December) have the highest likelihood of experiencing delays
Read and format project data
# Learn morea about Code Cells: https://quarto.org/docs/reference/cells/cells-jupyter.html# Include and execute your code hereurl ="https://raw.githubusercontent.com/byuidatascience/data4missing/master/data-raw/flights_missing/flights_missing.json"df = pd.read_json(url)Airport_nan = df['airport_name'].isna().sum()year_nan = df['year'].isna().sum()min_delay_nan = df['minutes_delayed_carrier'].isna().sum()min_delay_nas_nan = df['minutes_delayed_nas'].isna().sum()year_nan
np.int64(23)
Highlight the Questions and Tasks
QUESTION|TASK 1
Fix all of the varied missing data types in the data to be consistent (all missing values should be displayed as “NaN”).
I went through and made a new data frame for me to preform some cleaning. I then went through and searched for all the blanks, 1500+ minutes, and values that had -999 in them. then used the “.replace” to get them out and replace them with “NaN”.
Read and format data
# Include and execute your code hereimport pandas as pdimport numpy as npfrom lets_plot import*LetsPlot.setup_html(isolated_frame=True)
::: {#cell-Q1-Clean Data .cell execution_count=4}
Cleaned data
# Include and execute your code heredf_clean = dfdf_clean.replace(["", "1500+", -999], np.nan, inplace=True)example = df_clean.iloc[2].to_json()example
I chose to look at the average time each flight is delayed determining the worst airport delays. In the table you will find the total delayed flights and total delayed minutes for each airport. The portion of delayed flights measure tells us what percentage of flights out of that airport are delayed. The last column displays the average hours each flight is delayed. This helped me determine that Chicago O’Hare International Airport (ORD) is the worst airport to fly out of.
Read and format data
# Include and execute your code hereimport pandas as pdimport numpy as npfrom lets_plot import*LetsPlot.setup_html(isolated_frame=True)
The orst airport based on average delay hours
# Include and execute your code heredf_clean_group = df_cleangrouped_data = df_clean_group.groupby('airport_code').agg( total_flights=('num_of_flights_total', 'sum'), total_delays=('num_of_delays_total', 'sum'), total_delay_minutes=('minutes_delayed_total', 'sum')).reset_index()grouped_data['proportion_delayed_flights'] = (grouped_data['total_delays'] / grouped_data['total_flights']) *100grouped_data['average_delay_hours'] = (grouped_data['total_delay_minutes'] / grouped_data['total_delays']) /60worst_airport = grouped_data.loc[grouped_data['average_delay_hours'].idxmax()]worst_airport
What is the best month to fly if you want to avoid delays of any length?
I chose to look at the proportion of delayed flights to determine what month is the best month to fly. I chose this as my metric because I feel this paints an accurate picture about what the chances are of your flight being delayed dependent on month. From what I found, the best month to fly is September. 16.5% of all the flights that occur in September end up being delayed.
Read and format data
# Include and execute your code hereimport pandas as pdimport numpy as npfrom lets_plot import*LetsPlot.setup_html(isolated_frame=True)
Your job is to create a new column that calculates the total number of flights delayed by weather (both severe and mild).
I was able to create the new row and add it to the table. The column “weather_related_delays” will be used for the next question.
Read and format data
# Include and execute your code hereimport pandas as pdimport numpy as npfrom lets_plot import*LetsPlot.setup_html(isolated_frame=True)
Weather delays
# Include and execute your code heredf_clean_two = df_cleandf_clean_two['num_of_delays_late_aircraft'] = df_clean_two['num_of_delays_late_aircraft'].replace(-999, np.nan)df_clean_two['num_of_delays_late_aircraft'].fillna(df['num_of_delays_late_aircraft'].mean(), inplace=True)def weather_nas_delays(row):if row['month'] in ['April', 'May', 'June', 'July', 'August']:return0.4*round(row['num_of_delays_nas'], 2)else:return0.65*round(row['num_of_delays_nas'], 2)df_clean_two['weather_related_delays'] = ( df_clean_two['num_of_delays_weather'] +round(0.3* df_clean['num_of_delays_late_aircraft'], 2) + df_clean_two.apply(weather_nas_delays, axis=1))df_clean_two.head(5)
airport_code
airport_name
month
year
num_of_flights_total
num_of_delays_carrier
num_of_delays_late_aircraft
num_of_delays_nas
num_of_delays_security
num_of_delays_weather
num_of_delays_total
minutes_delayed_carrier
minutes_delayed_late_aircraft
minutes_delayed_nas
minutes_delayed_security
minutes_delayed_weather
minutes_delayed_total
weather_related_delays
0
ATL
Atlanta, GA: Hartsfield-Jackson Atlanta Intern...
January
2005.0
35048
NaN
1109.104072
4598
10
448
8355
116423.0
104415
207467.0
297
36931
465533
3769.43
1
DEN
Denver, CO: Denver International
January
2005.0
12687
1041
928.000000
935
11
233
3153
53537.0
70301
36817.0
363
21779
182797
1119.15
2
IAD
NaN
January
2005.0
12381
414
1058.000000
895
4
61
2430
NaN
70919
35660.0
208
4497
134881
960.15
3
ORD
Chicago, IL: Chicago O'Hare International
January
2005.0
28194
1197
2255.000000
5415
5
306
9178
88691.0
160811
364382.0
151
24859
638894
4502.25
4
SAN
San Diego, CA: San Diego International
January
2005.0
7283
572
680.000000
638
7
56
1952
27436.0
38445
21127.0
218
4326
91552
674.70
Weather Delays
QUESTION|TASK 5
Using the new weather variable calculated above, create a barplot showing the proportion of all flights that are delayed by weather at each airport. Describe what you learn from this graph.
According the the data San Francisco International Airport (SFO) experiences the highest proportion of delays due to bad weather days. Approximately 9.81% of the flights from SFO are delayed due to weather. Upon further research I found that San Francisco experiences frequent oceanic fog that causes delays and groundings for outbound flights. The next airport to avoid flying out of is Chicago O’Hare International Airport (ORD) with 8.5% of their flights become delayed.
Read and format data
# Include and execute your code hereimport pandas as pdimport numpy as npfrom lets_plot import*LetsPlot.setup_html(isolated_frame=True)
Weather delays by airport
# Include and execute your code heredf_clean_three = df_clean_twodf_clean_three['proportion_weather_delays'] = (df_clean_three['weather_related_delays'] / df_clean_three['num_of_flights_total']) *100airport_weather_delays = df_clean_three.groupby('airport_code').agg(proportion_weather_delays=('proportion_weather_delays', 'mean')).reset_index()( ggplot(airport_weather_delays, aes(x='airport_code', y='proportion_weather_delays')) + geom_bar(stat='identity') + ggtitle('Proportion of Weather Delays by Airport') + xlab('Airport Code') + ylab('Proportion of Weather Delays (%)') + theme(axis_text_x=element_text(angle=45, hjust=1)))